Menu:
Dataset Characteristics
Analysis Steps
Analysis Summary
Contributor
Citation:
Baltimore Police Department, (2016). Calls for Service [Calls_for_Service.csv]. Retrieved from https://data.baltimorecity.gov/Public-Safety/Calls-for-Service/xviu-ezkt, 10/10/2016.
License:
This data set is under the Creative Commons Attribution 3.0 Unported License. Under this license, the users can use, share, and modify the data without restraint, and must maintain the same for others as well. Additionally, the user must give appropriate credit to the dataset owner, and list any changes that were made from the original dataset, and include a link to the license. License Link: https://creativecommons.org/licenses/by/3.0/legalcode
Metadata:
This data set contains a list of all the 911 emergency calls made in the city of Baltimore from the year 2015.
The dataset contains no metadata, other than what is provided in the column names. We used our best judgement to find what column names represent.
Here are the exact column names, and our interpretation of them:
Issues encountered with the data
We did not remediate all the data issues we found, only the data issues that pertained to the questions we wanted to answer from this data set. We did not want to unnecessarily modify the data set.
We downloaded the Baltimore Emergency Calls data set from https://data.baltimorecity.gov/Public-Safety/Calls-for-Service/xviu-ezkt
The original data set consists of almost 2 million records. We downloaded the data in CSV format, then imported the data into R.
We decided to use a random sample of 50,000 rows for analysis of the first and second question. One team member took a random sample, and distributed the same random sample to the rest of the team, so that we would all have the exact same sample.
Step by step instructions:
##Load the data into R
mydata=read.csv('Final Subset Data CSV.csv')
##Take a random sample of 50,000 rows
subsetcalls <- mydata[sample(1:nrow(mydata), 50000, replace=FALSE),]
##Convert to Excel format, so that the random sample can be distributed to the team members
##install xlsx package
library(xlsx)
write.xlsx(x = subsetcalls.dataframe, file = "test.excelfile.xlsx",sheetName = "TestSheet", ##row.names = FALSE)
| Index | CallDateTime | Priority | District | Description | CallNumber | IncidentLocation | Location |
|---|---|---|---|---|---|---|---|
| 44699 | 0:49:00 | High | ED | & 99 | P153060067 | HARFORD RD/CLIFTVIEW AV | (39.315795,-76.596304) |
| 11175 | 11:39:00 | Low | SE | **SERV OF A WAR | 150541007 | 1800 ORLEANS ST | (39.2961670,-76.5919375) |
| 475 | 2:04:00 | Medium | NE | * | 150040299 | 5000 PLYMOUTH RD | (39.3514250,-76.5742210) |
| 42673 | 16:06:00 | Medium | WD | *` | 150131865 | 0 N FULTON AV | (39.2882849,-76.6452158) |
| 23032 | 1:08:00 | High | ED | *2 MALE | 150460163 | 1000 E PRESTON ST | (39.3052050,-76.6035010) |
Question 1: Are more 911 calls related to burglary made during the night? Or is that just a preconceived notion?
To answer this I first need to perform data cleaning on the ‘Description’ and ‘CallDateTime’ fields
Data Cleaning
##---CallDateTime field---
## 1. This question needs the time and not the date, so remove the dates using excel
## Click on the column header so that the entire column and all its rows are highlighted.
## Right click and select 'Format Cells'.
## In the pop up that opens, go to the 'Number tab', and select the Category as 'Time'.
## Select the time format '13:30:55' and click on 'Ok'. Now the date values should disappear
## 2. Read data into R
subsetcalls=read.csv('Final Subset Data CSV.csv')
## 3. Convert into 'Time' format
## Install chron package
library(chron)
subsetcalls$CallDateTime=chron(times=subsetcalls$CallDateTime)
##---Description field---
## 1. Look through all the distinct 'Description' values and identify the ones related to robberies
## Install and load sqldf - so that sql commands can be used in R
allrobbery<-sqldf('SELECT DISTINCT subsetcalls.Description FROM subsetcalls')
allrobbery<-sqldf('SELECT DISTINCT subsetcalls.Description, COUNT(Description) FROM subsetcalls WHERE Description in ("ATTEMP BURGLARY", "BURGLARPY", "Burglary", "BURGLARY", "POSS BURGLARY", "POSS STOLEN", "Robbery Armed", "ROBBERY ARMED", "Robbery Unarmed", "ROBBERY UNARMED", "ATTEMPT THEFT") GROUP BY Description')
## 2. Modify all the 'Description' values related to robberies to read "BURGLARY"
subsetcalls$Description[subsetcalls$Description=="ATTEMP BURGLARY"]<-"BURGLARY"
subsetcalls$Description[subsetcalls$Description=="BURGLARPY"]<-"BURGLARY"
subsetcalls$Description[subsetcalls$Description=="Burglary"]<-"BURGLARY"
subsetcalls$Description[subsetcalls$Description=="POSS BURGLARY"]<-"BURGLARY"
subsetcalls$Description[subsetcalls$Description=="POSS STOLEN"]<-"BURGLARY"
subsetcalls$Description[subsetcalls$Description=="Robbery Armed"]<-"BURGLARY"
subsetcalls$Description[subsetcalls$Description=="ROBBERY ARMED"]<-"BURGLARY"
subsetcalls$Description[subsetcalls$Description=="Robbery Unarmed"]<-"BURGLARY"
subsetcalls$Description[subsetcalls$Description=="ROBBERY UNARMED"]<-"BURGLARY"
subsetcalls$Description[subsetcalls$Description=="ATTEMPT THEFT"]<-"BURGLARY"
## 3. The final modified subset for this particular question, after data cleaning
finalburglary<-sqldf('SELECT subsetcalls.Description, subsetcalls.CallDateTime FROM subsetcalls WHERE Description = "BURGLARY" order by subsetcalls.CallDateTime')
To perform analysis, I need to categorize the times as Morning, Afternoon, Evening, Night
## 1. Create a new column called 'Time_Of_Day', and enter the default value as 'Night'
finalburglary["Time_Of_Day"]<- 'Night'
## 2. Label the hours from 5AM-11AM as Morning, 11AM-5PM as Daytime, and 5PM-11PM as Evening
finalburglary$Time_Of_Day[finalburglary$CallDateTime > '05:00:00' & finalburglary$CallDateTime <= '11:00:00'] <- 'AMMorning'
finalburglary$Time_Of_Day[finalburglary$CallDateTime > '11:00:00' & finalburglary$CallDateTime <= '17:00:00'] <- 'Daytime'
finalburglary$Time_Of_Day[finalburglary$CallDateTime > '17:00:00' & finalburglary$CallDateTime <= '23:00:00'] <- 'Evening'
## I have entered Morning as "AMMorning" because the 4 categories are displayed by alphabetical order in the plot, and Morning needs to appear first. I have changed the category label name in the plot.
Plot the data
## Obtain the counts of each category in the form of a table, this will help with plotting the data
burglarycounts<- table(finalburglary$Time_Of_Day)
## Used a barplot. Set the X and Y axis labels, the column names, and the colors.
barplot(burglarycounts, main = "Burglaries Count by Time of Day", xlab = "Time of Day", ylab = "Number of Burglaries", ylim = c(0, 600), col = c("lightseagreen", "brown1", "chartreuse3", "darkorchid1"), names.arg=c("Morning", "Day Time", "Evening", "Night"))
## Display the burglary count and timing for each category
text(0.7, 100, "5AM-11AM")
text(1.9, 100, "11AM-5PM")
text(3.1, 100, "5PM-11PM")
text(4.3, 100, "11PM-5AM")
text(0.7, 270, "234")
text(1.9, 470, "428")
text(3.1, 470, "432")
text(4.3, 270, "228")
Question 2: Do the Number of blank calls increase or decrease with the arrival of the weekend?
To answer this I first need to perform data cleaning on the ‘Description’ and ‘CallDateTime’ fields
Data Cleaning
#---Standardize the Date column in excel---
# Click on the column header so that the entire column and all its rows are highlighted.
# Right click and select 'Format Cells'.
# Change all the values to the format 'mm/dd/yyyy hh:mm'
# Load the data into R
subsetcalls = read.csv("Final Subset Data Original CSV 1.csv")
#---Description field---
#selecting all distinct types of call descriptions and storing into a table
distinctDescriptionValues <- sqldf('SELECT DISTINCT subsetcalls.Description FROM subsetcalls
WHERE Description like "%Hangup%"')
#selecting only various types of no voice calls and its corressponding count and storing it into a table
allblankcalls=sqldf('SELECT DISTINCT subsetcalls.Description, COUNT(Description)
from subsetcalls WHERE Description in
("911/NO VOICE","911/No Voice")
GROUP BY Description')
#selecting only various types of hangup calls and its corressponding count and storing it into a table
allblankcalls1=sqldf('SELECT DISTINCT subsetcalls.Description, COUNT(Description)
from subsetcalls WHERE Description in
("911/HANGUP","911/HANGUP.","911 HANGUP")
GROUP BY Description')
#converting the column description from factor to character
subsetcalls$Description=as.character(subsetcalls$Description)
#replaceing all forms of no voice with a single NO_VOICE
subsetcalls$Description[subsetcalls$Description=="911/NO VOICE"]="NO_VOICE"
subsetcalls$Description[subsetcalls$Description=="911/No Voice"]="NO_VOICE"
#replacing all forms of hangups with a single HANGUP
subsetcalls$Description[subsetcalls$Description=="911/HANGUP"]="HANGUP"
subsetcalls$Description[subsetcalls$Description=="911/HANGUP."]="HANGUP"
subsetcalls$Description[subsetcalls$Description=="911 HANGUP"]="HANGUP"
#converting the description column back to factor
subsetcalls$Description=as.factor(subsetcalls$Description)
#checking the converted values
allblankcalls=sqldf('SELECT DISTINCT subsetcalls.Description, COUNT(Description)
from subsetcalls WHERE Description in
("NO_VOICE")
GROUP BY Description')
allblankcalls1=sqldf('SELECT DISTINCT subsetcalls.Description, COUNT(Description)
from subsetcalls WHERE Description in
("HANGUP")
GROUP BY Description')
allblankcalls
## Description COUNT(Description)
## 1 NO_VOICE 9411
allblankcalls1
## Description COUNT(Description)
## 1 HANGUP 85
#---CallDateTime field---
#using library timeDate. Install timeDate package
library(timeDate)
#removed dates from calldate column and put it into dates column
subsetcalls$dates=format(as.POSIXct(subsetcalls$CallDateTime,format='%m/%d/%Y'),format='%m/%d/%Y')
#converted dates column into date format (was in factor format)
subsetcalls$dates=as.timeDate(subsetcalls$dates)
#adding a column called days that converts all the dates into days
subsetcalls$days=dayOfWeek(subsetcalls$dates)
#based on days column - creating the weekend column that makes all the weekeends (SAT & SUN) into "1"s
subsetcalls$weekend=isWeekend(subsetcalls$dates)
#removed time from calldate column and put it into time column
subsetcalls$time=format(as.POSIXct(subsetcalls$CallDateTime, format="%m/%d/%Y %H:%M"), format="%H:%M:%S")
#use chron library to convert into time format
#this converts the time into times format and puts it into time2 column
subsetcalls$time2<- chron(times=subsetcalls$time)
#this adds an extra column that gives the valus TRUE for friday also as part of a weekend
subsetcalls$weekend1=isWeekend(subsetcalls$dates,1:4)
#This makes a subset of all values that corresspond to the weekend
#with an extra condition that friday and over 7pm on are also included
abc1=subset(subsetcalls, subsetcalls$weekend == "TRUE" | (subsetcalls$days=="Fri" & subsetcalls$time2 >= "19:00:00"))
#this makes a subset of all the values with weekday upto friday 7pm
abc2=subset(subsetcalls,subsetcalls$weekend1 =="FALSE" | (subsetcalls$days=="Fri" & subsetcalls$time2 < "19:00:00"))
#WITH FRIDAY AS PART OF WEEKEND
#this is the dataset of weekend with friday and NO_VOICE
weekend_1=subset(abc1,abc1$Description=="NO_VOICE")
#this is the dataset of weekend with friday and HANGUP
weekend_2=subset(abc1,abc1$Description=="HANGUP")
#this is the dataset of NO_VOICE in weekdays (friday excluded)
weekday_1=subset(abc2,abc2$Description=="NO_VOICE")
#this is the dataset of HANGUP in weekdays (friday excluded)
weekday_2=subset(abc2,abc2$Description=="HANGUP")
#WITHOUT FRIDAY AS PART OF WEEKEND
#this is the dataset of weekend without friday and NO_VOICE
weekend_3=subset(subsetcalls, subsetcalls$weekend == "TRUE" & subsetcalls$Description == "NO_VOICE")
#this is the dataset of weekend without friday and HANGUP
weekend_4=subset(subsetcalls, subsetcalls$weekend == "TRUE" & subsetcalls$Description == "HANGUP")
#this is the dataset of NO_VOICE in weekdays (friday included)
weekday_3=subset(subsetcalls, subsetcalls$weekend == "FALSE" & subsetcalls$Description == "NO_VOICE")
#this is the dataset of HANGUP in weekdays (friday included)
weekday_4=subset(subsetcalls, subsetcalls$weekend == "FALSE" & subsetcalls$Description == "HANGUP")
Plot the data
#Plotting Graphs
#bar plots
#converting days column from character to factor
subsetcalls$days=as.factor(subsetcalls$days)
#forming a subset for overall number of no_voice and hangup calls
subsetcalls1=subset(subsetcalls, subsetcalls$Description=="NO_VOICE")
subsetcalls2=subset(subsetcalls, subsetcalls$Description=="HANGUP")
#printing out the summary of the days column - this gives us the number of calls for each day
summary(subsetcalls1$days)
## Fri Mon Sat Sun Thu Tue Wed
## 1398 1365 1287 1276 1316 1328 1441
summary(subsetcalls2$days)
## Fri Mon Sat Sun Thu Tue Wed
## 15 5 18 10 14 13 10
#saving the number of no_voice and hangup calls into vectors
no_voice=c(1365,1328,1441,1316,1398,1287,1276)
hangup=c(15,5,18,10,14,13,10)
#plotting number of no_voice calls daywise
barplot(no_voice, main="NO VOICE CALLS", beside=TRUE,
col=terrain.colors(7),ylim = c(0,1600),xlab = "DAYS - Monday through Sunday",ylab = "Number of NO_VOICE Calls",col.lab="Blue",col.main="Red")
#plotting the number of hangup calls daywise
barplot(hangup, main="HANGUP CALLS", beside=TRUE,
col=terrain.colors(5),ylim = c(0,25),xlab = "DAYS - Monday through Sunday",ylab = "Number of Calls",col.lab="Blue",col.main="Red")
#ggplot
#using library ggplot, install the ggplot2 package
library(ggplot2)
#reading the file consolidates that has all the values for number of calls during the weekend/weekday
q=read.csv("consolidated.csv")
#ggplot
ggplot(q,aes(x=variable,y=count,fill=factor(index)))+
geom_bar(stat="identity",position="dodge")+
scale_fill_discrete(name="Legend",
breaks=c(1, 2,3,4),
labels=c("Weekend(Fri Included)", "Weekday(Fri excluded)","Weekend(Friday Excluded)","Weekday(Friday Included)"))+
xlab("Type of Calls")+ylab("Number of Calls")
Question 3: Find the top 5 dangerous reasons for calls (e.g. Aggravated assault). Plot them based on the locations.
From which location (street) did most of these calls originate? Can we find the most dangerous locality in Baltimore based on the 911 calls?
Is there a pattern for each of these calls with respect to their proximity to the coast? Do these calls increase in frequency when you get closer to the coast?
Data Cleaning
# Downloaded a sample of our dataset having callDateTime values starting from 01 January 2016 upto 10th Dec 2016
# Removed all rows having location values as (,) and (0.000000,0.000000) by filtering the data in Excel
# Split the location column values into two columns Latitude and Longitude in Excel using the 'Text to Columns' option
# Importing the dataset obtained after performing the above cleaning operations
subsetcalls = read.csv("Calls_for_Service.csv")
#cleaning on description column
#importing library sqldf - so that sql commands can be used in R
library(sqldf)
#selecting all distinct types of call descriptions and storing into a table
distinctDescriptionValues<-sqldf('select distinct description,count(description) from subsetcalls group by description')
# Selected the top dangerous reasons for 911 calls as 'Aggravated Assault','Destruction of property','Discharge Firearm/Shooting' and 'Robbery'
# Grouped similar categories having considerable count of rows and gave them a common label in Excel. Used the filter option for selecting descriptions
# For each selected reason, I found different variations of the same description label (different cases and spellings) aand created a common label
# AGGRAV ASSAULT,Aggrav Assault -> Aggravated Assault
# DESTRUCT PROP,DESTRUCT PROPTY,Destruct Propty -> Destruction of property
# DISCHRG FIREARM,Dischrg Firearm,SHOOTING,Shooting -> Discharge Firearm & Shooting
# ROBBERY ARMED,Robbery Armed,ROBBERY UNARMED,Robbery Unarmed -> Robbery
# Load the cleaend Excel file having only the records belongng to the four descriptions mentioned above
dangerouscalls = read.csv("dangerous_calls.csv")
Plot the data
# Creating a map of Baltimore and plotting the locations of the most dangerous 911 calls as shortlisted above
# get_map is a smart wrapper that queries the Google Maps, OpenStreetMap, Stamen Maps or Naver Map servers for a map.
# It returns a ggmap object.
# Source attribute tells us the type of map to be used. By defaut, it uses Google map.
# geom_point is used to plot points on the map. It is generally used to create scatter plots
# scale_color_manual is used to map color values
# ggtitle For giving a title to the plot
library(ggmap)
ggmap(get_map(location = "baltimore",zoom=12)) +
geom_point(data=dangerouscalls,aes(x=dangerouscalls$Longitude,y=dangerouscalls$Latitude,colour=description)) +
scale_color_manual(values=c("red","blue","orange","black")) + ggtitle("Top dangerous reasons for 911 calls in Baltimore County")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=baltimore&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=baltimore&sensor=false
# Creating dataframes for each description and plotting them individually on the map
# Discharge Firearm & Shooting
dischgfirearm = subset(dangerouscalls,dangerouscalls$description=="Discharge Firearm & Shooting")
ggmap(get_map(location = "baltimore",zoom=12)) + geom_point(data=dischgfirearm,aes(x=dischgfirearm$Longitude,y=dischgfirearm$Latitude),color="orange") + ggtitle("Calls related to Discharge Firearm/Shooting in 2016")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=baltimore&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=baltimore&sensor=false
# Aggravated Assault
aggassault = subset(dangerouscalls,dangerouscalls$description=="Aggravated Assault")
ggmap(get_map(location = "baltimore",zoom=12)) + geom_point(data=aggassault,aes(x=aggassault$Longitude,y=aggassault$Latitude),color="red") + ggtitle("Calls related to Aggravated Assaults in 2016")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=baltimore&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=baltimore&sensor=false
# Destruction of property
destructprop = subset(dangerouscalls,dangerouscalls$description=="Destruction of property")
ggmap(get_map(location = "baltimore",zoom=12)) + geom_point(data=destructprop,aes(x=destructprop$Longitude,y=destructprop$Latitude),color="blue") + ggtitle("Calls related to Destruction of property in 2016")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=baltimore&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=baltimore&sensor=false
# Robbery
robbery = subset(dangerouscalls,dangerouscalls$description=="Robbery")
ggmap(get_map(location = "baltimore",zoom=12)) + geom_point(data=robbery,aes(x=robbery$Longitude,y=robbery$Latitude),color="black") + ggtitle("Calls related to Robbery in 2016")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=baltimore&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=baltimore&sensor=false
# This lists down the 4 dangerous calls descriptions with their count
summary(dangerouscalls$district)
## CD CW ED EVT2 ND NE NW SD SE SS SW TRU WD
## 3408 22 3126 1 3100 4627 3128 3889 3760 11 3336 814 2922
# Creating a vector for the districts count
districts=c(4627,3889,3760,3408,3336,3128,3126,3100,2922)
# Plotting a bar plot which shows the districtwise distribution of dangerous 911 calls
barplot(districts, main="Districtwise distribution of 911 calls related to dangerous crimes", beside=TRUE,
col=terrain.colors(9),ylim = c(0,5500),xlab = "NE SD SE CD SW NW ED ND WD",
ylab = "Number of 911 Calls",col.lab="Blue",col.main="Red")
This data set contains a list of all the 911 emergency calls made in the city of Baltimore from the year 2015, and is updated every day. This data is gathered by ‘Open Baltimore’ and the city of Baltimore, and contains almost 2 million records!
The data contains information such as the time of call, latitude/longitude, district, description, and priority. We downloaded the data, and used Excel and R to analyze the data and produce plots. We came up with 3 questions that could possibly be answered using the data.
Question 1: Are more 911 calls related to burglary made during the night? Or is that just a preconceived notion?
The target audience here is Baltimore’s police department (PD). This analysis could help the police department decide if they need to be more alert for burglaries and increase their patrol duty during the night. After processing the data, we were able to get a count of all the burglary related calls for each time of the day: morning, afternoon, evening, and night.
This plot shows that the number of burglary related calls for the afternoon and evening is almost double the number of calls for the morning and night. The graph suggests that the idea that more burglaries occur at night is a preconceived notion. The graph also suggests that more burglaries occur during the afternoon and evening hours (11AM-11PM) when people are outside and their homes are empty. Burglars target houses when they are empty, rather than when it is dark outside. Hence, the PD need not increase its patrol duty during the night time. This analysis does have limitations. This data only reports 911 calls that were made about burglaries, it does not include the regular theft reports directly made to the police department.
Question 2: Whether the number of blank calls increase/decrease with the arrival of the weekend? To answer this question, I need to use the columns ‘callDateTime’ and ‘description’. The target audience for this analysis is Baltimore PD. If there are more blank calls during the weekend, the PD can infer that most of them are prank calls made by partygoers.
The graph shows that the number of blank calls does not change across the week. Hence the PD cannot make assumptions about blank calls made during the weekend based on this analysis.
Question 3: Find the top 5 dangerous reasons for calls. Plot them based on the locations.
Can we find the most dangerous locality in Baltimore based on the 911 calls?
Is there a pattern for the calls with the proximity to the coast?
The target audience for this analysis are prospective home buyers in Baltimore County. They could avoid buying a house in the most dangerous locality.
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=baltimore&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=baltimore&sensor=false
## Warning: Removed 203 rows containing missing values (geom_point).
According to the graph, the center of Baltimore city has the most emergency calls, the farther away from the city, the less the number of calls. Furthermore, there is no pattern of calls with respect to the coast. Hence prospective home buyers can look for homes away from the center of the city, and need not consider proximity to the coast.
Our project has three different research questions. We divided the questions amongst us so that each one can concentrate on analyzing and answering one question. In addition to working on our own questions, we divided the task of preparing each of the three draft documents by aggregating information from the other two team mmembers. Divya prepared the Data Cleaning Draft and the first question. Akash prepared the R Implementation Draft and the third question. Samrjith prepared the R Plot Draft and the second question. For the final Git package, all of us worked on our individual components. Divya aggregated the components and prepared the final zip package for uploading to Git. Akash upoaded the zip packge to Git. Akash and Samarjith worked on the presentation with inputs from Divya.